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ABSTRACT 


Mathematical  errors  in  Manfred  Kochen's  book, 
Principles  of  Information  Retrieval,  give  the 
reader  the  erroneous  impression  that  the  log-normal 
distribution  is  a better  fit  than  Yule's  Beta-function 
distribution  for  predicting  frequency  distributions 
of  scientific  productivity  of  econometricians  and 
mathematicians.  Correction  of  these  errors  suggests 
that  the  Beta  function  gives  a better  fit. 
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COMMENTS  ON : PRINCIPLES  OF  INFORMATION  RETRIEVAL 
by  Manfred  Kochen 

Kochen' s^  book.  Principles  of  Information  Retrieval,  was 

(2) 

recently  reviewed  in  the  Journal  of  Documentation  by  Vickery  . 
The  review  was  concluded  with  the  statement,  "All  who  are 
concerned  with  the  future  development  of  information  science 
will  have  something  to  learn  from  this  book."  Indeed,  I learned 
a great  deal  while  trying  to  sift  out  misinformation  from 
information  in  the  section  called  "Authorship."  In  view  of  the 
kind  endorsement  by  a distinguished  reviewer,  I hesitate  to 
sound  a discordant  note,  but,  in  an  effort  to  assist  other 
readers  who  might  have  become  confused  by  this  section,  may  I 
point  out  some  of  the  errors? 

On  page  85  of  this  book,  Kochen  makes  the  following 
statement: 

"It  is  asserted  (Learnes,  1953)  for  example, 
that  the  number  of  authors  who  published  exactly 
r papers  in  Econometrica  over  a 20-year  period 
is  approximately  a . " 

( 3 ) 

First,  the  name  was  Dickson  H.  Leavens  , not  Learnes.  Second, 
he  did  not  assert  any  formula.  He  did  present  a table  of  data 
giving  the  cumulative  frequency  distribution  of  contributors 
to  Econometrica  and  a figure  showing  this  cumulative  number  of 
contributors  plotted  versus  the  number  of  contributions  on  log- 
log  graph  paper.  Leavens  commented  that  the  slope  of  the 
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straight  line  that  he  had  drawn  in  freehand  with  what  he  termed 
"a  fair  fit  for  all  but  the  last  three  points"  was  approximately 
-1.5,  as  he  expected  in  accordance  with  Pareto  theory. 

Third,  if  he  had  presented  a formula,  it  would  certainly 

£ — —Jr 

not  have  been  a but  would  probably  have  been  ar  . By 
, . . (4) 

curious  coincidence.  Sawyer  gives  an  example  of  a mistake 

remarkably  similar  to  this  one  in  his  discussion  of  the  detection 

of  mathematical  misprints.  As  a matter  of  interest,  an  expression 

—1  56 

based  on  all  but  the  last  three  points  is  868r  * . Fourth, 

this  expression  with  Leavens'  exponent  of  approximately  -1.5  is 

for  the  cumulative  number  of  authors  who  have  contributed  "at 

least  r " papers  in  the  Pareto  sense,  and  the  factors  are  not 

those  for  Kochen's  "exactly  r " papers  in  the  Lotka ^ sense. 

-1  94 

For  "exactly  r " papers,  we  would  have  38 9r  * , based  on  all 

but  the  last  three  data  points. 

On  page  86,  Table  4.1  purports  to  give  the  number  of  authors 
contributing  r papers  to  Econometrica  over  a 20-year  period. 

First,  the  column  heading  for  r , which  reads  "11,"  is  wrong. 

This  should  be  "11  or  more"  or  "greater  than  10,"  but  not  simply 
"11."  Second,  all  the  numbers  given,  i.e.,  824,  217,  94,  50,  etc., 
are  wrong.  The  correct  numbers,  as  obtained  from  Leavens' 

Table  I,  are  given  in  a proposal  for  a revised  Table  4.1  for 
Econometricians  at  the  end  of  this  letter.  Third,  as  a matter  of 
interest  to  Sherlock  Holmes  fans,  the  numbers  824,  217,  94,  50,  etc.. 
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are  not  for  econometricians  but  are  estimates  of  the  number  of 
physicists  making  important  contributions  in  physics  throughout 
history  up  to  the  year  1900.  These  estimates  for  physicists 
are  to  be  found  in  a paper  by  Simon  ^ . In  Simon's  paper,  a 
column  for  "estimate"  of  physicists  is  adjacent  to  a column  for 
"actual"  econometricians,  and  was  apparently  copied  by  mistake. 
Fourth,  as  further  evidence  of  the  source  of  these  erroneous 
numbers  (824,  217,  etc.),  Simon's  table  only  goes  up  to  "11  or 
more"  contributions,  while  Leavens'  table  goes  right  on  up  to 
46  contributions. 

On  page  85,  there  is  a statement: 

"Some  data  has  been  compiled  (Davis,  1941) 
to  show  that  the  number  of  authors,  in  a sample 
of  authors  who  have  written  r papers,  fits  a 

Yule-type  distribution." 

(7) 

First,  Davis  never  mentioned  Yule  or  the  Yule-type  distribution. 
Harold  T.  Davis,  a research  associate  of  the  Cowles  Commission 
for  Research  in  Economics,  examined  previously  published  data  on 
scientific  productivity  of  mathematicians,  chemists,  and  physicists 
in  an  effort  to  find  statistical  support  for  his  Pareto-type  law 
of  distribution  of  special  abilities.  Davis  studied  the  data  on 
scientific  productivity  of  mathematicians  published  by 
Arnold  Dresden  ' . Davis  grouped  Dresden's  data  into  intervals 
of  7 centering  on  the  central  value,  accumulated  the  number  of 
authors,  and  fitted  a parabolic  curve  to  the  accumulation  obtaining 
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log  y = 3.74877  - 2.11012  log  x 


where  y is  the  number  of  persons  making  at  least  x contributions. 
It  should  be  noted  that  the  slope  Davis  obtained  of  2.11  was  much 
larger  than  the  normal  1.5  usually  associated  with  the  Pareto 
law. 

/ 9 ) 

Second,  G.  Udny  Yule  developed  a probability  model  in  1924 
in  connection  with  analysis  of  the  distribution  of  biological  genera 
by  number  of  species.  Herbert  A.  Simon  studied  this  Beta-function 
model  and  proposed  its  possible  application  as  a distribution 
function  for  scientific  publications,  calling  it  the  "Yule"  distri- 
bution. This  was  in  1955,  some  years  after  Davis'  book  was 
published  in  1941. 

Third,  since  Kochen  neither  described  the  "Yule"  distribution 
nor  referenced  Simon's  application  of  it  to  scientific  productivity, 
it  might  be  desirable  for  someone  to  explain  what  it  is.  Simon's 
"Yule"  distribution  is  a Beta-function  probability  model, 

f (i)  = A B (i,  p + 1) 

where  f(i)  is  the  number  of  authors  contributing  a 
given  number,  i,  of  papers  to  journals 

A and  p are  constants  and 

B(i,p  + 1)  is  the  Beta-function  of  i,p  + 1 . 

In  Simon's  notation. 
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1 

p 1 + a 
nk 

where  a = £— 

is  total  authors 

k is  total  papers 

1 

B (i,  p + 1)  = f X1"1  (1  - Xp)  dX  = Hi)  T(p  + 1)  (o  < i,  o < p < oo)  . 
•'o  r(i  + p + 1) 

Simon  derived  the  estimated  frequencies  from  his  equations  (2.11) 
and  (2.21)  which  are 


6 (i) 


(1  - a) (i  - 1)  = f * (i) 

1 + (1  - a ) i f * (i  - 1) 


(i  2 , . • . , k) 


f * (1) 


(2.11) 


(2.21) 


Fourth,  Simon  tested  his  Yule  distribution  on  four  sets  of 
scientific  productivity  data  which  he  had  found  published  earlier 
by  Davis  and  Leavens.  This  included  Davis'  collection  of  Dresden's 
data  on  mathematicians  and  Lotka's  data  on  chemists  and  physicists 
as  well  as  Leavens'  data  on  econometricians.  In  Simon's  paper, 
the  seventh  column  of  his  Table  3 contained  the  Yule  estimates  of 
physicists  which  was  adjacent  to  the  eighth  column  which  contained 
the  actual  numbers  of  econometricians,  as  was  mentioned  earlier. 
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On  page  86,  there  is  a statement, 

"The  log-normal  distribution,  on  the  other 
hand,  gives  a better  fit  than  the  Yule  distri- 
bution to  the  data  shown  in  Table  4.1." 

First,  Table  4.1  does  not  give  estimates  for  mathematicians 
and  econometricians  calculated  by  either  the  log-normal 
distribution  or  the  Yule  distribution.  It  is  impossible, 
therefore,  to  compare  either  estimates  with  the  actual  numbers 
of  authors  given.  Second,  there  are  no  goodness-of-f it 
calculations  given  to  support  the  opinion  that  the  log-normal 
distribution  is  better  than  the  Yule  distribution.  Third, 
Simon,  in  discussing  qualitatively  his  Yule  distribution, 
stated, 

"The  fit  is  reasonably  good,  when  it  is 
remembered  that  only  one  parameter  is  available 
for  adjustment.  However,  it  should  be  noted 
that  the  estimated  frequencies  tend  to  be  too 
high  for  i = 1,2  and  too  low  for  i = 3,  . . . , 10  . 
In  three  of  the  four  cases,  they  are  again  too 
high  for  the  tails  of  the  distributions.  A 
further  refinement  of  the  model  is  apparently 
needed  to  remove  these  discrepancies." 

Fourth,  a proposed  revision  of  Table  4.1,  which  includes 
the  estimated  Yule  distribution  for  mathematicians,  is  given 
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below.  Since  Simon,  for  some  unknown  reason,  did  not  provide 
a Yule  estimate  of  the  number  of  mathematicians  making  one 
contribution,  a computer-generated  Yule  distribution  is  given 
as  well  as  Simon's  incomplete  Yule  distribution.  A chi-square 
goodness-of-f it  calculation  has  therefore  been  made  using 
the  computer's  Yule  distribution,  which  has  a value  of  31.5. 

For  7 degrees  of  freedom,  the  0.05  level  is  14.1,  which  shows 
that  the  Yule  distribution  is  a poor  fit,  as  Simon  predicted. 
However,  estimates  of  the  log-normal  distribution  with  x2 
of  93.1  show  it  to  be  a terrible  fit  in  contrast  to  Kochen's 
opinion  that  the  log-normal  was  better  than  the  Yule.  Finally, 
a proposed  revision  of  Table  4.1,  to  include  the  estimated 
Yule  distribution  for  econometricians,  is  also  given  below. 

The  computer's  Yule  estimates  differ  slightly  from  Simon's 
estimates.  Again,  the  Yule  fit  is  poor  but  better  than  the 
terrible  fit  of  the  log-normal  estimates. 


REVISED  TABLE  4.1  FOR  MATHEMATICIANS 


Number  of  authors 
contributing  r 
papers  to  the 
Chicago  Section 

of  American  Math  Simon's  Computer's  Log- 

Society  over  25-  Yule  Yule  normal 

r year  period  estimate  estimate  estimate 


1 

133 

— 

158.6 

66.6 

2 

43 

46 

47.7 

27.7 

3 

24 

23 

22.0 

15.8 

4 

12 

14 

12.4 

10.4 

5 

11 

10 

7.8 

7.3 

6 

14 

7 

5.4 

| 9.7 

7 

5 

5 

3.9 

1 

8 

3 

4 

2.9 

| 

9 

9 

3 

2.2 

* 8.4 

10 

1 

3 

1.8 

11  or  more 

23 

30 

13.4 

19.4 

X231.5  X293.1 


for  d of  f = 7 

.05  level  R:  x2  il  14.1 
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REVISED  TABLE  4.1  FOR  ECONOMETRICIANS 


r 


Number  of  authors 
contributing  r 

papers  to  Simon's  Computer's 

Econometrica  over  Yule  Yule 

20-year  period  estimate  estimate 


Log- 

normal 

estimate 


1 

436 

453 

453.4 

195.2 

2 

107 

119 

122.7 

74.9 

3 

61 

51 

52.3 

39.8 

4 

40 

27 

27.5 

24.5 

5 

14 

16 

16.5 

16.5 

6 

23 

11 

10.7 

20.5 

7 

6 

7 

7.4 

1 

8 

11 

5 

5.3  ’ 

9 

1 

4 

4.0  ] 

16.2 

10 

0 

3 

3.1 

11  or  more 

22 

25 

18.1 

27.5 

X223.9 

Xz25.4 

X2  338 

for  d of 

f 

= 8 

. 05  level 

R: 

X2  1 15.5 
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